Avoid retry on non-recoverable errors.#101
Avoid retry on non-recoverable errors.#101StevenYCChou wants to merge 7 commits intoStackdriver:masterfrom
Conversation
| func (b *sampleBuilder) seriesGetWithRetry(ctx context.Context, sample tsdb.RefSample) (*seriesCacheEntry, bool) { | ||
| backoff := time.Duration(0) | ||
| entry, ok, err := b.series.get(ctx, sample.Ref) | ||
| for { |
There was a problem hiding this comment.
This doesn't seem to honor ctx.Done anymore. Is that intended?
There was a problem hiding this comment.
now ctx.Done() is captured in this function. If context cancellation happens, it will at least be caught here (though if ctx.Err() is thrown inside series.get(), it won't immediately handle it. It needs to wait until next iteration to check ctx.Done()).
I didn't re-check ctx.Done() after calling series.geT() is because I think checking ctx.Done() is better at the top of the for-loop for simplicity and readability.
|
Can you include the rationale for this change? |
|
Still work in progress - as I haven't provided rationale yet. |
|
@jkohen ready for review. please take a look on the new change. The rationale is updated in 1st comment of the PR. |
retrieval/transform.go
Outdated
| if err == nil { | ||
| break | ||
| } | ||
| if _, ok := err.(unknownMetricError); ok { |
There was a problem hiding this comment.
I'm not sure about adding such specific error taxonomy. We may end up with too many conditionals. I would ask what could lead to different handling. Is it just about recoverable and permanent? Could we call it recoverable, as in other parts of the code?
Only retry when calling seriesCache.get().
For error caused by unexpected metric name suffix or by unexpected metric type, it should return as unrecoverable error because retrying doesn't help the situation.
|
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
8d5cbec to
027c2bf
Compare
|
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Rationale behind this change:
in
retrieval.manager(), retry with exponential backoff is used when building Stackdriver sample. However, there are 2 types of errors which will always be thrown, and retrying wouldn't help:stackdriver-prometheus-sidecar/retrieval/transform.go
Line 96 in bfe3938
stackdriver-prometheus-sidecar/retrieval/transform.go
Line 114 in bfe3938
This goal of this change is to avoid retrying on these 2 types of errors. The implementation is to only retry on recoverable errors.
When I implemented this, I also found that
seriesCache.refresh()will potentially throw errors which is mostly not recoverable, hence making sure we wouldn't retry on non-recoverable errors.